System Management in the BlueGene/L Supercomputer
نویسندگان
چکیده
The BlueGene/L supercomputer will use system-on-achip integration and a highly scalable cellular architecture to deliver 360 Teraflops of peak computing power. With 65,536 compute nodes, BlueGene/L represents a new level of scalability for parallel systems. As such, it is natural for many scalability challenges to arise. In this paper, we discuss challenges in the area of system management and control, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65,536 compute nodes are organized in 1,024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65th node, called an I/O node. The 1,024 processing sets can then be managed to a great extent as a regular Linux cluster, of which there are several successful examples. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network. Our software development and experiments have been conducted so far in architecturally accurate simulators of BlueGene/L, and we are gearing up to test real prototypes in 2003.
منابع مشابه
Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer
BlueGene/L is a massively parallel supercomputer that is currently the fastest in the world. Implementing MPI, and especially fast collective communication operations can be challenging on such an architecture. In this paper, I will present optimized implementations of MPI collective algorithms on the BlueGene/L supercomputer and show performance results compared to the default MPICH2 algorithm...
متن کاملExtracting Message Types from BlueGene/L’s Logs
In this paper we present the results on extracting message types from the BlueGene/L supercomputer logs using the IPLoM (Iterative Partitioning Log Mining) algorithm. Previous work using IPLoM indicates that IPLoM shows promise as message type extraction algorithm. We compared the results of IPLoM against manually produced message types produced on the BlueGene/L data. To provide a baseline of ...
متن کاملObtaining Hardware Performance Metrics for the BlueGene/L Supercomputer
Hardware performance monitoring is the basis of modern performance analysis tools for application optimization. We are interested in providing such performance analysis tools for the new BlueGene/L supercomputer as early as possible, so that applications can be tuned for that machine. We are faced with two challenges in achieving that goal. First, the machine is still going through its final de...
متن کاملImplementing MPI on the BlueGene/L Supercomputer
The BlueGene/L supercomputer will consist of 65,536 dual-processor compute nodes interconnected by two high-speed networks: a three-dimensional torus network and a tree topology network. Each compute node can only address its own local memory, making message passing the natural programming model for BlueGene/L. In this paper we present our implementation of MPI for BlueGene/L. In particular, we...
متن کاملThe BlueGene/L Supercomputer and Quantum ChromoDynamics
We describe our methods for performing quantum chromodynamics (QCD) simulations that sustain up to 20% of the peak performance on BlueGene supercomputers. We present our methods, scaling properties, and first cutting edge results relevant to QCD. We show how this enables unprecedented computational scale that brings lattice QCD to the next generation of calculations. We present our QCD simulati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003